Recap
Inference
Least squares as
Least Squares
So far, we…
…start with a vector \(\mathbf{Y} = (Y_1 \ldots Y_i \ldots Y_n)\)
… want to produce a vector of predicted values \(\mathbf{\widehat{Y}} = (\widehat{Y_1} \ldots \widehat{Y_i} \ldots \widehat{Y_n})\)
…. define \(\mathbf{\widehat{Y}} = \mathbf{X}\mathbf{\beta}\):
Then, we…
… choose \(\beta\) such that \(\mathbf{\widehat{Y}}\) have minimum (euclidean) distance to \(\mathbf{Y}\):
\(\mathbf{e}\) is minimized when \(\mathbf{e}'\mathbf{X} = \mathbf{0}\): residuals are orthogonal to \(\mathbf{X}\).
This gave us \(\mathbf{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\)
This definition of least squares means that:
residuals \(\mathbf{e}\) are orthogonal to \(\mathbf{X}\); so if there is an intercept, the mean of the residuals is \(0\), and covariance of residuals and \(\mathbf{X}\) is \(0\).
If we just fit an intercept (same value of \(\widehat{Y}\) for all cases): what is the mean of the residuals \(E(Y_i - \widehat{Y_i})\)?
Let’s say we want to get the conditional expectation function for yearly earnings of professionals as a function of hours worked, profession, gender, and age.
Using data from the American Community Survey, we can look at how hours worked is related to earnings for doctors and lawyers.
UHRSWORK (the usual hours worked per week)FEMALE (an indicator for female)AGE (in years)LAW (an indicator for being a lawyer)INCEARN (total annual income earned in dollars).Let’s now find the linear conditional expectation function of earnings, across hours worked, age, profession, and gender:
\(Earnings_i = b_0 + b_1 Hours_i + b_2 Age_i + b_3 Law_i + b_4 Female_i\)
##
## Call:
## lm(formula = INCEARN ~ UHRSWORK + AGE + LAW + FEMALE, data = acs_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -304486 -90963 -29238 72845 743690
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11083.5 9484.6 1.169 0.243
## UHRSWORK 1075.5 114.7 9.373 <2e-16 ***
## AGE 3452.8 125.2 27.578 <2e-16 ***
## LAW -48919.9 2674.6 -18.291 <2e-16 ***
## FEMALE -37223.1 2799.8 -13.295 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 127800 on 9995 degrees of freedom
## Multiple R-squared: 0.146, Adjusted R-squared: 0.1457
## F-statistic: 427.2 on 4 and 9995 DF, p-value: < 2.2e-16
How do we make sense of, e.g., the slope on UHRSWORK?
How could we change our regression equation to make the slope on hours worked easier to interpret?
Angrist and Pischke, MHE, pp. 77-80
We want to interpret the \(\beta\) vector causally.
There is a true \(\beta_D\) (\(\delta\)) that is the effect of some \(D\) on \(Y\).
\(\beta_D\) is an unknown parameter.
To estimate \(\beta_D\), need a statistical model for least squares
As with experiments (e.g. the Neyman causal model), statistical models
Need to clarify what work is being done by
Hooke’s law is a law of physics that states that the force (\(D\)) needed to extend or compress a spring by some distance \(y\) increases linearly with respect to the distance.
Thus, for a given weight \(D_i\) on a spring, the length \(Y_i\) is determined by:
\(Y_i = \alpha + \beta D_i + \epsilon_i\) for \(i\) in \(1 \ldots n\)
\(Y_i = \alpha + \beta D_i + \epsilon_i\) for \(i\) in \(1 \ldots n\)
This is a statistical model:
We use observed data to estimate the parameters \(\alpha, \beta, \sigma^2\)
Imagine we have a spring whose length is 30 cm. With each Newton of force applied, the spring increases by 0.5 cm.
Using these facts and the statistical model described above: create a response schedule/potential outcomes table corresponding to the spring under loads of \(0,1,2,3,5,8\) Newtons.
The statistical model assumes that this linear equation describes the process generating the data.
\[Y_i = \alpha + \beta D_i + \epsilon_i\]
deterministic (not random) \(\alpha + \beta D_i\)
stochastic (random error) \(\epsilon_i\)
\(\alpha\): parameter for the intercept (fixed for all cases)
\(\beta\): parameter for the slope (fixed for all cases)
\(D_i\): value of independent variable (input into the equation)
\(Y_i\): value of dependent variable (determined by the equation)
\(\epsilon_i\): random error for observation \(i\)
The more general regression model (where \(X_i\) is a matrix of variables)
\[Y_i = \mathbf{X_i\beta} + \epsilon_i\]
can be estimated with least squares:
\[Y_i = X_i\hat{\beta} + e_i\]
where residuals used to estimate variance of errors: \(Var(e) = \widehat{\sigma^2}\)
Differences from Least Squares Algorithm
Least Squares is the Best Linear Unbiased Estimator of our model parameters under 3 key assumptions.
\((1)\). Model equation is correctly specified
In groups: come up with 2 ways this assumption could be wrong in the context of this model:
\(Earnings_i = b_0 + b_1 Hours_i + b_2 Age_i + b_3 Law_i + b_4 Female_i + \epsilon_i\)
\((2.)\) \(\epsilon_i\) are independent, identically distributed (i.i.d.)
In groups: come up with 2 ways this assumption could be wrong in the context of the Hooke’s Law example.
\((3.)\) \(\epsilon_i\) is independent of \(X_i\): \(X_i \perp \!\!\! \perp \epsilon_i\)
In groups: come up with 2 ways this assumption could be wrong in the context of this model:
\(Earnings_i = b_0 + b_1 Hours_i + b_2 Age_i + b_3 Law_i + b_4 Female_i + \epsilon_i\)
Under these three assumptions, we call the model “ordinary least squares”.
We cannot empirically verify these assumptions.
In particular, cannot verify assumption 3:
If these assumptions are reasonable for the data we observe, then with large \(n\), we can estimate our parameters \(\beta, \sigma^2\) using least squares.
And \(\beta\) could have a causal interpretation.
What work is done by
Mathematical Assumptions (least squares):
Statistical Assumptions (statistical model):
If \(\mathbf{X}\) is \(n \times p\) matrix: We want to estimate \(p + 1\) parameters… (what are they?)
\(\beta\) estimated by \(\widehat{\beta}\) using least squares:
\[\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\]
\(Y_i = X_i\beta + \epsilon_i\)
\(\epsilon_i\) is random variable.
If we were to repeat the drawing of \(\epsilon\) again, we would get different estimates: \(\widehat{\beta}\).
Under the statistical assumption we made, \(\widehat{\beta}\) is an unbiased estimator of \(\beta\), conditional on \(X\): \(E(\widehat{\beta}|X) = \beta\)
By assumption: \(Y = \mathbf{X\beta} + \epsilon\), so:
\(\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\)
\(\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'(\mathbf{X\beta} + \epsilon)\)
\(\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\)
\(\widehat{\beta} = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\)
\(\widehat{\beta} = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\)
So:
\(E(\widehat{\beta}|X) = E(\mathbf{\beta}|\mathbf{X}) + E((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon|\mathbf{X})\)
\(E(\widehat{\beta}|X) = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon|\mathbf{X})\)
And because, by assumption, \(X \perp \!\!\! \perp \epsilon\)
\(E(\widehat{\beta}|X) = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon)\)
And because, by assumption, \(E(\epsilon) = 0\)
\(E(\widehat{\beta}|X) = \mathbf{\beta}\)
Thus, if our model assumptions are correct, \(\widehat{\beta}\) is an unbiased estimate of causal parameter \(\beta\)
How do we discover the uncertainty in our estimate \(\widehat{\beta}\)?
(all \(\beta\)s should be hatted)
\[\scriptsize{\begin{pmatrix} Var(\beta_1) & Cov(\beta_1,\beta_2) & Cov(\beta_1,\beta_3) &\ldots & Cov(\beta_1,\beta_p) \\ Cov(\beta_2,\beta_1) & Var(\beta_2) & Cov(\beta_2,\beta_3) & \ldots & Cov(\beta_2,\beta_p) \\ Cov(\beta_3,\beta_1) & Cov(\beta_3,\beta_2) & Var(\beta_3) & \ldots & Cov(\beta_3,\beta_p) \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ Cov(\beta_p,\beta_1) & Cov(\beta_p,\beta_2) & Cov(\beta_p, \beta_3) & \ldots & Var(\beta_p)\end{pmatrix}}\]
How do we use the variance-covariance matrix?
The square-root of diagonal elements (variances) gives standard error for each parameter estimate in \(\widehat{\beta}\) (hypothesis testing)
The off-diagonal elements can help answer: \(\beta_2 + \beta_3 \neq 0\). We need \(Cov(\beta_2, \beta_3)\) to get \(Var(\beta_2 + \beta_3)\). (complex hypothesis testing, e.g. interaction effects )
\(\widehat{\beta} = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\), So:
\[cov(\widehat{\beta}|X) = E((\widehat{\beta} - \beta)(\widehat{\beta} - \beta)' | X)\]
\[ = E( ((\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\epsilon)((\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\epsilon)' | X)\]
\[ = E( ((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon)(\epsilon'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}) | X)\]
\[ = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon\epsilon'|X)\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\]
What really matters here is: \(E(\epsilon\epsilon'|X)\)
All variance-covariance matrices are “sandwiches”:
\[ (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon\epsilon'|X)\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\]
\((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\) is the bread; \(E(\epsilon\epsilon'|X)\) is the meat.
Later on, we can make different choices of “meat”.
In groups: if \(\epsilon\) is a vector that is \(n \times 1\) with elements \((\epsilon_1 \ldots \epsilon_n)\)
What are the dimensions of the matrix: \(\epsilon\epsilon'\)?
What are the elements on the diagonal of the matrix?
What is the expected value of the elements on the off-diagonal
Hints: We have assumed that \(E(\epsilon) = 0\); \(\epsilon_i\) are independent of each other
\[E(\epsilon\epsilon'|X) = \sigma^2\mathbf{I}_{n\times n}\]
\(\epsilon\epsilon'\) is \(n \times n\) matrix with \(ij\)th elements \(\epsilon_i\epsilon_j\)
Taking the expectation:
The result is a \(n \times n\) matrix, with \(\sigma^2\) on the diagonal, \(0\) everywhere else.
\[ = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\sigma^2\mathbf{I}_{n\times n}\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\]
\[ = \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\] \[cov(\widehat{\beta}|X) = \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\]
This gives us the variance-covariance matrix of \(\widehat{\beta}\). Can we start computing standard errors for \(\widehat{\beta}\) now?
We use residuals to stand in for \(\epsilon\).
\[\widehat{\sigma^2} = \frac{\sum\limits_i^n e_i^2}{n - p}\]
\(E[\widehat{\sigma^2} | X] = \sigma^2\) (unbiased)
so, we estimate the variance-covariance matrix:
\(\widehat{cov}(\widehat{\beta}) = \widehat{\sigma^2}(\mathbf{X}'\mathbf{X})^{-1}\)
The variance (\(p\)th diagonal element of \(\widehat{cov}(\widehat{\beta})\)) is
\(Var(\hat{\beta_p}) = \frac{\hat{\sigma^2}}{\sum\limits_i^n (X^*_{ip} - \overline{X_p^*})^2}\)
So the standard error is:
\(SE(\hat{\beta_p}) = \sqrt{Var(\hat{\beta_p})} = \frac{\widehat{\sigma}}{\sqrt{n}\sqrt{Var(X_p^*)}}\)
asymptotic normality:
Even if \(\epsilon\) not normally distributed: under our assumptions, then \(\lim_{n\to\infty}\) sampling distribution of \(\hat{\beta}\) is normally distributed by the central limit theorem. Reasonable if \(n\) is moderately large and data is not highly skewed.
Because we estimate standard errors: \(t\) statistic for hypothesis testing of \(\widehat{\beta}\) asymptotically \(t\) distributed:
In sum:
For standard errors of \(\widehat{\beta}\) to be correct, we must assume:
If any assumption is wrong: standard errors will be incorrect.
If assumption 1 and 3 are wrong: unbiasedness of \(\widehat{\beta}\) does not hold.
In groups:
What is an example of the assumption being wrong?
Why would it lead our standard errors to be wrong?
When do things go wrong?
With these assumptions: